BigQuery

Overview

BigQuery uses SQL as its querying language and provides numerous support functions to query and interact with data. For details see
https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax

Within queries references to tables can be fully qualified using the form

DATASET.TABLE

For example,

SELECT * FROM `myDataset.Person`

Query Context

When Qarbine performs a query it automatically sets a default dataset context. The value is from the Qarbine Data Service’s generic “database” parameter set by the Qarbine administrator. For example, if set to “myDataset” then the following query is valid..

SELECT * FROM Person

Qarbine uses the Google BigQuery Node.js client library. As a result the location in your job or query configuration must be set if your dataset is not in the default US multi-region. All tables referenced in a query must be in the same location as the job. Mixing single-region and multi-region locations will cause errors.The Qarbine Administrator defines the location when configuring the Qarbine Data Service.

For more information on locations see https://cloud.google.com/bigquery/docs/locations

General Data Handling

BigQuery provides a variety of data types for applications to use. They are discussed at
https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types

Booleans, dates, numerics, and strings are the expected JavaScript types within Qarbine answer sets.

JSON Handling

BigQuery supports JSON data type columns. A sample table with a JSON column can be defined with

CREATE TABLE myDataset.products (
  id INT64,
  details JSON
);

and populated with

INSERT INTO myDataset.products (id, details)
VALUES
  (1, JSON '{"name": "Book", "price": 12.99, "tags": ["fiction", "bestseller"]}'),
  (2, JSON '{"name": "Pen", "price": 1.99, "tags": ["stationery"]}');

A sample query using BigQuery JSON_VALUE function calls is

SELECT  id,
  JSON_VALUE(details, '$.name') AS name,
  JSON_VALUE(details, '$.price') AS price
FROM   myDataset.products;

Another query using the JSON_QUERY_ARRAY function is shown below.

SELECT id,  JSON_QUERY_ARRAY(details, '$.tags') AS tags
FROM  myDataset.products;

A general query is shown below.

select *
 from products
 limit 25

Running it in BigQuery Studio shows

Running it in the Data Source Designer tools shows

The details of the first row are shown below.

Notice the details is a JSON string and not an object.

Qarbine provides a very flexible and powerful set of functions to manipulate answer sets so they are easier to review and interact with when authoring analytics. It is much easier to interact with genuine JSON objects rather that their string representation. Qarbine pragmas cn be used to accomplish this as shown below.

#pragma convertToObject details
select *
 from products
 limit 25

The Qarbine answer set is now

and the details of the first row are shown below.

Notice the details are now a JSON object with specific fields and values rather than a simple string. A sample template formula for this object is

@current.details.name

and

#details.name

It can be better to have the fields within the detail field up a level so this can be accomplished using the following query specification. The order of the pragma matters.

#pragma convertToObject details
#pragma pullFieldsUp details
select *
 from products
 limit 25

The new answer set looks like the following.

This answer set format provides much easier access to the query result values. Notice the movie details are within the base field. Sample template formulas are

@current.name

and

#name

Vector Searching

BigQuery supports scalable vector search, enabling similarity search over high-dimensional data (such as embeddings from machine learning models). The embeddings used for searching can come from Google Vertex AI or other embedding service. Remember that whatever model is used to obtain the embedding for the search must match that used in the stored data. Queries return items whose embeddings are most similar (nearest neighbors) to a given query embedding, supporting use cases like semantic search and recommendations.

The distance algorithms available for vector search are listed below.

Distance Type	Description
EUCLIDEAN	Standard L2 (straight-line) distance between vectors (default).
COSINE	Measures the cosine of the angle between vectors (similarity).
DOT_PRODUCT	Measures the dot product (projection) of vectors.

When using the VECTOR_SEARCH function, you specify the distance algorithm via the distance_type argument. For example:

SELECT *
FROM VECTOR_SEARCH(
  (SELECT * FROM my_table),
  'embedding_column',
  (SELECT [0.1, 0.2, 0.3, ...] AS embedding_column),
  top_k => 5,
  distance_type => 'COSINE'  -- or 'EUCLIDEAN', 'DOT_PRODUCT'
)

Details on using vector search with Big Query can be found at
https://cloud.google.com/bigquery/docs/vector-search

With the following query

SELECT title, reviewtext,
  VECTOR_SEARCH_DISTANCE AS distance
FROM
  VECTOR_SEARCH(
    (SELECT * FROM movies),
    'reviewEmbedding',
    (SELECT [0.12, -0.45, 0.33] AS reviewEmbedding),
    top_k => 5,
    distance_type => 'COSINE'
  );

an error is raised as

Unrecognized name: title at [2:1]

This is because the output of the VECTOR_SEARCH table function does not provide the columns from the source table as first level columns.

Also, note that if the query embedding dimension does not match that define in the index then the following error is raised.

Dimension column `reviewEmbedding` in the base table does not match the dimension
of column `query_embedding` in the query data.

Qarbine provides macro functions which can obtain embeddings and also generate testing ones. Running this query

SELECT *
FROM
  VECTOR_SEARCH(
    (SELECT * FROM movies),
    'reviewEmbedding',
    (SELECT  [!  json( sampleEmbeddings(1536, 0.01), null, null, true ) !]  AS query_embedding),
    top_k => 5,
    distance_type => 'COSINE'
  )

results in the following.

The last parameter to json(...) indicates not to use quotes when replacing the value in the query specification with the JSON string. This is required to obtain the correct query syntax.

The movie details are within the base object. The first row’s details are shown below.

Notice the movie details are within the base field. A sample template formula is

@current.base.title

and

#base.title

The query portion is not needed in the answer set so it can be removed using this query

select t.base, distance from
(
SELECT *
FROM
  VECTOR_SEARCH(
    (SELECT * FROM movies),
    'reviewEmbedding',
    (SELECT  [!  json( sampleEmbeddings(1536, 0.01), null, null, true ) !]  
       AS reviewEmbedding),
    top_k => 5,
    distance_type => 'COSINE'
  ) 
 ) as t

The new answer set is shown below.

The first row’s details are shown below.

The nesting of the movie details within the base field is not optimal. Next we will pull the values within the base field up to the main level and remove the embedding as well using Qaarbine pragmas. The order of pragmas matters.

#pragma pullFieldsUp base
#pragma deleteFields reviewEmbedding
select t.base, distance from
(
SELECT *
FROM
  VECTOR_SEARCH(
    (SELECT * FROM movies),
    'reviewEmbedding',
    (SELECT  [!  json( sampleEmbeddings(1536, 0.01), null, null, true ) !]
       AS reviewEmbedding),
    top_k => 5,
    distance_type => 'COSINE'
  ) 
 ) as t

The result is shown below in 2 snapshots.

These rows are much easier to view and interact with when using template formulas. Notice the movie details are within the base field. Sample template formulas are

@current.title

and

#title

With BigQuery the line

(SELECT [    [! sampleEmbeddings(1536, 0.01) !]  ] AS query_embedding),

results in an error

{"error":"The query was SELECT *\nFROM\nVECTOR_SEARCH(\n(SELECT * FROM movies),\n'reviewEmbedding', (SELECT [    (0.01, 0.01, ...)  ] AS query_embedding),\ntop_k => 5,\ndistance_type => 'COSINE'\n)"}

This is because the “[ (0.01, 0.01, ...) ]” looks like a tuple to BigQuery and not a single value. Notice the enclosing brackets and within them enclosing parentheses. We want the SQL lines to look like

SELECT [ 0.01, 0.01, ... ] AS reviewEmbedding

One option is

SELECT [! removeBookends( sampleEmbeddings(1536, 0.01), true) !] 
AS reviewEmbedding

and another is

SELECT  [! json( sampleEmbeddings(1536, 0.01), null, null, true ) !]
 AS reviewEmbedding

The last argument of ‘true’ to removeBookends() and json() are needed so that the replaced string is not quoted in the final query sent to BigQuery.

Arrays

BigQuery supports storing homogenous lists of primitive and structure values by using the Array data type. A sample table with an array can be defined with

CREATE TABLE myDataset.fruits (
  id INT64,
  name STRING,
  colors ARRAY<STRING>
);

and populated with

INSERT INTO myDataset.fruits (id, name, colors) VALUES
  (1, 'Apple', ['red', 'green', 'yellow']),
  (2, 'Banana', ['yellow']),
  (3, 'Grape', ['purple', 'green']),
  (4, 'Orange', ['orange']);

In BigQuery Studio running the query

select * from myDataset.fruits 

results in this answer set.

Running the query in the Data Source Designer results in this answer set.

The details of the 2nd row are shown below.

Notice the colors value is a true list.

Using SQL you can unnest the colors array to get one row per color per fruit:

SELECT name, color
FROM fruits,
UNNEST(colors) AS color;

The answer set is shown below , but notice that the size of the answer set can become quite large because of all of the duplicate row values for each single original row.

Array and Struct Handling

BigQuery provide support for inner records by using the Struct data type. A sample table with an array containing a struct can be defined with

CREATE TABLE `myDataset.people` (
  id STRING,
  first_name STRING,
  last_name STRING,
  dob DATE,
  addresses ARRAY<STRUCT<
    status STRING,
    address STRING,
    city STRING,
    state STRING,
    zip STRING,
    numberOfYears STRING
  >>
);

It can be populated using this SQL.

INSERT INTO `myDataset.people` (id, first_name, last_name, dob, addresses)
VALUES
  (
    "1", "John", "Doe", DATE "1968-01-22",
    [
      STRUCT("current", "123 First Avenue", "Seattle", "WA", "11111", "1"),
      STRUCT("previous", "456 Main Street", "Portland", "OR", "22222", "5")
    ]
  ),
  (
    "2", "Jane", "Doe", DATE "1980-10-16",
    [
      STRUCT("current", "789 Any Avenue", "New York", "NY", "33333", "2"),
      STRUCT("previous", "321 Main Street", "Hoboken", "NJ", "44444", "3")
    ]
  );

In BigQuery Studio the following query

select * from `myDataset.people`

results in this answer set shown in 2 snapshots

In The Data Source Designer the answer set looks like the following.

The details of the first row are shown below.

Notice that the Array and its inner Struct data are genuine JSON objects which are easily interacted with from within Qarbine. Legacy tool approaches typically have answer sets of simple strings which then require code to be written to interact with the values for anaklytic purposes.

Qarbine Virtual Queries

There are a few convenience queries which are mainly DBA oriented. These queries are recognized by the Qarbine driver and provide common database information. Any catalog and schema set in the data service definition constrains what is returned. For example, if a catalog is given in the data service, then only schemas in that one catalog are returned.

These virtual query defaults are independent of whatever drop down option is chosen in the Data Source Designer tool. If a specific schema’s information is wanted for example, it must be explicitly given.

Query	Description
list databases	Return a list of visible databases.
list tables [DATABASE]	Return a list of tables. The optional argument may be a database name.
describe tables [DATABASE]	Provide details on all of the tables. This may take a while depending on your database structure.
describe table TABLE	Provide details on the given table.

Troubleshooting

If queries in Qarbine are not returning the expected results then review then using BigQuery Studio. This is accessed in the left hand sidebar using the option highlighted below.

Resources

Additional resources for BigQuery can be found at https://cloud.google.com/bigquery

Sample data can be found at https://cloud.google.com/bigquery/public-data

BigQuery

Overview​

Query Context​

General Data Handling​

JSON Handling​

Vector Searching​

Arrays​

Array and Struct Handling​

Qarbine Virtual Queries​

Troubleshooting​

Resources​